Computed Tomography scanner (CT scan) is a widely spread and popular exam in oncology: it reflects the density of the tissues of the human body. It is, then, adapted to the study of lung cancer because lungs are mostly filled with air (low density) while tumors are made of dense tissues.
Small Cell Lung Cancer can itself be split into four major subtypes based on histology observations: squamous cell carcinoma, large cell carcinoma, adenocarcinoma and a mixture of all
Predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data.
To each patient corresponds one CT scan, and one binary segmentation mask. The segmentation mask is a binary volume of the same size as the CT scan, except that it is composed of zeroes everywhere there is no tumour, and 1 otherwise. The CT scans and the associated segmentation masks are subsets of two public datasets:
Both training and validation contain for each patient, the time to event (days), as well as the censorship. Censorship indicates whether the event (death) was observed or whether the patient escaped the study: this can happen when the patient’s track was lost, or if the patient died of causes not related to the disease.
reticulate::use_python("/Users/Mezhoud/anaconda3/bin/python3", required = TRUE)
reticulate::py_config()## python: /Users/Mezhoud/anaconda3/bin/python3
## libpython: /Users/Mezhoud/anaconda3/lib/libpython3.7m.dylib
## pythonhome: /Users/Mezhoud/anaconda3:/Users/Mezhoud/anaconda3
## version: 3.7.5 (default, Oct 25 2019, 10:52:18) [Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy: /Users/Mezhoud/anaconda3/lib/python3.7/site-packages/numpy
## numpy_version: 1.17.3
##
## NOTE: Python version was forced by use_python function
import numpy as np
from matplotlib import pyplot as plt
#from matplotlib import pyplot
from PIL import Image
img_array = np.load('train/images/patient_002.npz')
scan = img_array['scan']
mask = img_array['mask']
print("the dimension of scan array is: ", str(scan.shape))## the dimension of scan array is: (92, 92, 92)
## the dimension of mask array is: (92, 92, 92)
## plot some images from patient 002:
f, axarr = plt.subplots(2,3)
axarr[0,0].imshow(scan[1:92, 1:92, 0])
axarr[1,0].imshow(mask[1:92, 1:92, 0])
axarr[0,1].imshow(scan[:, :, 3])
axarr[1,1].imshow(mask[:, :, 3])
axarr[0,2].imshow(scan[:, :, 80])
axarr[1,2].imshow(mask[:, :, 80])
def plot_figures(figures, nrows = 1, ncols=1):
"""Plot a dictionary of figures.
Parameters
----------
figures : <title, figure> dictionary
ncols : number of columns of subplots wanted in the display
nrows : number of rows of subplots wanted in the figure
"""
fig, axeslist = plt.subplots(ncols=ncols, nrows=nrows)
for ind,title in zip(range(len(figures)), figures):
axeslist.ravel()[ind].imshow(figures[title], cmap=plt.jet())
axeslist.ravel()[ind].set_title(title)
axeslist.ravel()[ind].set_axis_off()
plt.tight_layout()
img_array = np.load('train/images/patient_002.npz')
scan = img_array['scan']
mask = img_array['mask']
# generation of a dictionary of (title, images)
number_of_im = 6
scan = {'scan'+str(i): scan[1:92, 1:92, i] for i in range(number_of_im)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(scan, 2, 3)
plt.show()The plot shows colored images scan of 6 slides. At this step it is not easy to distinguish the tumor.
The dataset has aslo the masks for each scan slide which locate the position of the tumor in the scan.
mask = {'mask'+str(i): mask[1:92, 1:92, i] for i in range(number_of_im)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(mask, 2, 3)
plt.show()At the end the size the Tumor is decreasing.
We can note that the crop is adjusted to the size of the tumor
img_array = np.load('train/images/patient_002.npz')
scan = img_array['scan']
mask = img_array['mask']
mask = {'mask'+str(i): mask[1:92, 1:92, i] for i in range(90)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(mask, 9, 10)
plt.show()If we compare with the scan slides, we obtain:
scan = {'scan'+str(i): scan[1:92, 1:92, i] for i in range(90)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(scan, 9, 10)
plt.show()It is always not easy to delimit the tumor in scan images
Comparing to masks, we can note that, between scan 34 and scan 65, the slides have more yellow stain or less blue color.
import numpy as np
from matplotlib import pyplot as plt
from PIL import Image
img_array = np.load('train/images/patient_002.npz')
scan = img_array['scan']
mask = img_array['mask']
background = mask[1:92, 1:92, 56]
overlay = scan[1:92, 1:92, 56]
plt.title("Scan/Mask: 56")
plt.imshow(background, cmap='gray')
plt.imshow(overlay, cmap='jet', alpha=0.9)img_array = np.load('test/images/patient_001.npz')
scan = img_array['scan']
mask = img_array['mask']
# generation of a dictionary of (title, images)
number_of_im = 90
scan = {'scan'+str(i): scan[1:92, 1:92, i] for i in range(number_of_im)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(scan, 9, 10)
plt.show()Plot mask slides from test dataset
mask = {'mask'+str(i): mask[1:92, 1:92, i] for i in range(number_of_im)}
# plot of the images in a figure, with 5 rows and 4 columns
plot_figures(mask, 9, 10)
plt.show()img_array = np.load('test/images/patient_001.npz')
scan = img_array['scan']
mask = img_array['mask']
background = mask[1:92, 1:92, 34]
overlay = scan[1:92, 1:92, 34]
plt.title("Scan/Mask: 34")
plt.imshow(background, cmap='gray')
plt.imshow(overlay, cmap='jet', alpha=0.9)In the test images, we can also observe tumor slides like in train dataset.
For training step, it maybe better to use masks slides than scan. But we need to explore variables in clinical data and radiomics and think how to associate images with numeric variables.
One think we can do is the convert slides to dataframe (each slide in one row) and then we can obtain one matrix for each patient tumor.
At this step I will switch from python to R :-)
The goal of this step is to convert image matrices as vector. So, each image can be ranged in one row. Finally, we can obtain one dataframe with 92 rows (images) ofr each sample (patient).
patient_002 <- reticulate::py$load_img_array('train/images/patient_002.npz')
paste0("One image is a: ", class(patient_002[[1]][,,1]))## [1] "One image is a: matrix"
## [1] "Two images are an: array"
## [1] "Print the first 10 pixels of Scan N°1: "
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] -777 -759 -707 -697 -749 -796 -826 -837 -858 -860
## [2,] -783 -791 -774 -768 -787 -808 -827 -826 -829 -829
## [3,] -804 -841 -827 -812 -820 -840 -831 -801 -792 -794
## [4,] -830 -857 -839 -816 -805 -818 -801 -764 -734 -722
## [5,] -844 -854 -843 -823 -799 -787 -771 -743 -704 -670
## [6,] -844 -849 -844 -831 -821 -810 -809 -791 -760 -719
## [7,] -848 -847 -848 -844 -847 -841 -849 -841 -829 -806
## [8,] -847 -856 -854 -846 -840 -813 -826 -849 -848 -836
## [9,] -840 -851 -836 -823 -835 -796 -803 -853 -857 -833
## [10,] -847 -841 -829 -817 -860 -832 -799 -842 -865 -838
## [1] "Print the first 10 pixels of Mask N°1: "
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
ls_scan_patient_002 <- lapply(seq(dim(patient_002[[1]])[3]), function(x) patient_002[[1]][ , , x])
ls_mask_patient_002 <- lapply(seq(dim(patient_002[[2]])[3]), function(x) patient_002[[1]][ , , x])
paste0("The dimension of the scan images is: ", length(ls_scan_patient_002))## [1] "The dimension of the scan images is: 92"
## [1] "The dimension of the mask images is: 92"
mat2vec <- function(path){
# Load patient CT scan
patient <- py$load_img_array(path)
# list scans
scan <- lapply(seq(dim(patient[[1]])[3]), function(x) patient[[1]][ , , x])
# list masks
mask <- lapply(seq(dim(patient[[2]])[3]), function(x) patient[[1]][ , , x])
# vectorize each matrix (image) into vector
vec_scan <- lapply(scan, function(x) as.vector(x))
# vectorise each mask (image) to vector
vec_mask <- lapply(mask, function(x) as.vector(x))
# bind vector into dataframe by row
df_scan <-as.data.frame( do.call(rbind, vec_scan))
df_mask <-as.data.frame( do.call(rbind, vec_mask))
# extract patien_id from path
scan_id <- paste0(tools::file_path_sans_ext(basename(path)), "_scan")
mask_id <- paste0(tools::file_path_sans_ext(basename(path)), "_mask")
# group in list the scan and the mask dataframes
ls <- list(df_scan, df_mask)
# Rename list
names(ls) <- c(scan_id, mask_id)
return(ls)
}
patient2 <- mat2vec('train/images/patient_002.npz')
paste0("The output is a: " ,class(patient2))## [1] "The output is a: list"
## [1] "With length of: 2"
## [1] "The names of two elements are: "
## [1] "patient_002_scan" "patient_002_mask"
## [1] "which are: data.frame"
## [1] "The dimension of each dataframe is: "
## [1] 92 8464
At this step we stop exploring scan and mask.
We think to use only masks for modeling
Potential method: keras, mxnet
radiomics <- fread("train/features/radiomics.csv", quote = "")
clinical <- fread("train/features/clinical_data.csv")
# display only 8 columns and 5 rows
head(radiomics)[,1:8]## V1 V2 V3
## 1: shape shape
## 2: original_shape_Compactness1 original_shape_Compactness2
## 3: PatientID
## 4: 202 0.027815034 0.274891585
## 5: 371 0.02301549 0.188210005
## 6: 246 0.027348106 0.265739895
## V4 V5
## 1: shape shape
## 2: original_shape_Maximum3DDiameter original_shape_SphericalDisproportion
## 3:
## 4: 48.55924217 1.537964054
## 5: 75.70336849 1.744961158
## 6: 70.43436661 1.555420243
## V6 V7
## 1: shape shape
## 2: original_shape_Sphericity original_shape_SurfaceArea
## 3:
## 4: 0.650210255 5431.33321
## 5: 0.573078659 10369.56873
## 6: 0.642913068 10558.81869
## V8
## 1: shape
## 2: original_shape_SurfaceVolumeRatio
## 3:
## 4: 0.275227763
## 5: 0.240726824
## 6: 0.200765988
## PatientID Histology Mstage Nstage SourceDataset Tstage age
## 1: 202 Adenocarcinoma 0 0 l2 2 66.0000
## 2: 371 large cell 0 2 l1 4 64.5722
## 3: 246 squamous cell carcinoma 0 3 l1 2 66.0452
## 4: 240 nos 0 2 l1 3 59.3566
## 5: 284 squamous cell carcinoma 0 3 l1 4 71.0554
## 6: 348 squamous cell carcinoma 0 2 l1 2 65.0212
The radiomics features can be divided into 4 groups as follows (shown in row 1): - Group 1. First order statistics - Group 2. Shape and size based features - Group 3. Textural features - Group 4. Wavelet features
Each group can be subset into several sub-groups shown in row 2 of the radiomics dataset. To make the radiomics features numeric dataset we need to remove the two first rows and convert them to colnames.
groups <- radiomics[1:2,-1] %>%
t() %>%
as.data.frame() %>%
rename("Groups" = V1, "Features" = V2) #%>%
# remove_rownames()
head(groups)## Groups Features
## V2 shape original_shape_Compactness1
## V3 shape original_shape_Compactness2
## V4 shape original_shape_Maximum3DDiameter
## V5 shape original_shape_SphericalDisproportion
## V6 shape original_shape_Sphericity
## V7 shape original_shape_SurfaceArea
To improve the esthetic of the dataframe, we note:
original is repetitive word. we can omit it.
The group label is included in Feature label except textural
We can remove original from Features and use use the rest as colnames of the radiomics dataset.
groups %>%
group_by(Groups, Features) %>%
summarise(n_Features = n()) %>%
ggplot() +
aes(x = Groups, y = n_Features, color = Features ) +
geom_col() +
theme(legend.position = "none") +
ggtitle("Number of Features by Group")new_colnames_radiomics <- groups %>%
mutate(Features = stringr::str_remove(Features,"original_")) %>%
pull(Features)
new_colnames_radiomics %>% head()## [1] "shape_Compactness1" "shape_Compactness2"
## [3] "shape_Maximum3DDiameter" "shape_SphericalDisproportion"
## [5] "shape_Sphericity" "shape_SurfaceArea"
old_names <- colnames(radiomics)
new_names <- c("PatientID", new_colnames_radiomics)
new_radiomics <- radiomics[-1:-3,] %>%
rename_at(vars(old_names), ~ new_names) %>%
mutate_if(is.character, as.numeric) #%>%
#as.matrix()
head(new_radiomics)[,1:8]## PatientID shape_Compactness1 shape_Compactness2 shape_Maximum3DDiameter
## 1 202 0.02781503 0.2748916 48.55924
## 2 371 0.02301549 0.1882100 75.70337
## 3 246 0.02734811 0.2657399 70.43437
## 4 240 0.02681111 0.2554064 46.81880
## 5 284 0.02369124 0.1994242 53.79591
## 6 348 0.03098136 0.3410383 63.74951
## shape_SphericalDisproportion shape_Sphericity shape_SurfaceArea
## 1 1.537964 0.6502103 5431.333
## 2 1.744961 0.5730787 10369.569
## 3 1.555420 0.6429131 10558.819
## 4 1.576120 0.6344693 4221.412
## 5 1.711620 0.5842418 5295.900
## 6 1.431305 0.6986630 8493.134
## shape_SurfaceVolumeRatio
## 1 0.2752278
## 2 0.2407268
## 3 0.2007660
## 4 0.3238780
## 5 0.3272407
## 6 0.1976017
M <- cor(new_radiomics[-1])
#corrplot(M, method = "circle")
corrplot.mixed(M, tl.col="black", tl.pos = "lt")